Introduction

The goal of midterm is to apply some of the methods for supervised and unsupervised analysis to a new dataset. We will work with data characterizing the relationship between wine quality and its analytical characteristics available at UCI ML repository as well as in this course website on canvas. The overall goal will be to use data modeling approaches to understand which wine properties influence the most wine quality as determined by expert evaluation. The output variable in this case assigns wine to discrete categories between 0 (the worst) and 10 (the best), so that this problem can be formulated as classification or regression – here we will stick to the latter and treat/model outcome as continuous variable. For more details please see dataset description available at UCI ML or corresponding file in this course website on canvas. Please note that there is another, much smaller, dataset on UCI ML also characterizing wine in terms of its analytical properties – make sure to use correct URL as shown above, or, to eliminate possibility for ambiguity, the data available on the course website in canvas – the correct dataset contains several thousand observations. For simplicity, clarity and to decrease your dependency on the network reliability and UCI ML availability you are advised to download data made available in this course website to your local folder and work with this local copy.

There are two compilations of data available under the URL shown above as well as in the course website in canvas – separate for red and for white wine – please develop models of wine quality for each of them, investigate attributes deemed important for wine quality in both and determine whether quality of red and white wine is influenced predominantly by the same or different analytical properties (i.e. predictors in these datasets). Lastly, as an exercise in unsupervised learning you will be asked to combine analytical data for red and white wine and describe the structure of the resulting data – whether there are any well defined clusters, what subsets of observations they appear to represent, which attributes seem to affect the most this structure in the data, etc.

Finally, as you will notice, the instructions here are terser than in the previous homework assignments. We expect that you use what you’ve learned in the class to complete the analysis and draw appropriate conclusions based on the data. All approaches that you are expected to apply here have been exercised in the preceeding weekly assignments – please feel free to consult your submissions and/or official solutions as to how they have applied to different datasets. As always, if something appears to be unclear, please ask questions – we may change to private mode those that in our opinion reveal too many details as we see fit.

Sub-problem 1: load and summarize the data (20 points)

Download and read in the data, produce numerical and graphical summaries of the dataset attributes, decide whether they can be used for modeling in untransformed form or any transformations are justified, comment on correlation structure and whether some of the predictors suggest relationship with the outcome.

Answer:

Briefly going through the following links [http://onlinelibrary.wiley.com/doi/10.1002/9781118730720.fmatter/pdf] , [http://winefolly.com/review/understanding-acidity-in-wine/] and other literature online and with some basic knowledge we can hypohesize the following about main attributes that effect the quality of wine. 1. acidity (fixed acidity,volatle acidity, citric acid etc..) 2. sugal levels (residual sugar) 3. ph - This is a measure of acidity 4. alochol (values-level) we will first look at the data. Remove null values. then analyze single attribute and hwo each of them compare to quality and then we will perform pair wise analysis.

#Read sample data
#wr- red wine
#ww- white wine
setwd("/Users/RaviRani/Documents/Harvard-Extension/CSCI E-63/midterm")
wr<-read.table("winequality-red.csv",sep=";",header=TRUE)
ww<-read.table("winequality-white.csv",sep=";",header=TRUE)
#head of red wine & white wine
head(wr)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
head(ww)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
# column names of white & red wine
colnames(wr)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
colnames(ww)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
# Dimension of red & white wines before removing null values
#dim(wr)
#dim(ww)
#convert to data frame
# Created variables for log and sqrt transformation
dfwr<-as.data.frame.matrix(wr) 
dfww<-as.data.frame.matrix(ww)

logdfwr<-as.data.frame.matrix(wr) 
logdfww<-as.data.frame.matrix(ww)

sqrtdfwr<-as.data.frame.matrix(wr) 
sqrtdfww<-as.data.frame.matrix(ww)
# check for null values  for both wines
sum(is.na(dfwr))
## [1] 0
dfwr<-na.omit(dfwr)

sum(is.na(dfww))
## [1] 0
dfww<-na.omit(dfww)

dim(dfwr)
## [1] 1599   12
dim(dfww)
## [1] 4898   12
# check for null is done
# untransformed
summary(dfwr)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
# drawing distribution of all attributes for red wine
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
barplot((table(dfwr$quality)), col=c("DeepSkyBlue4", "DeepSkyBlue", "DeepSkyBlue1", "DeepSkyBlue2", "DeepSkyBlue3", "DeepSkyBlue4"))
mtext("Quality", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$fixed.acidity, h = 0.5, col="DeepSkyBlue")
mtext("Fixed Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$volatile.acidity, h = 0.05, col="DeepSkyBlue")
mtext("Volatile Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$citric.acid, h = 0.1, col="DeepSkyBlue")
mtext("Citric Acid", side=1, outer=F, line=2, cex=0.8)

par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(dfwr$residual.sugar, h = 0.5, col="DeepSkyBlue")
mtext("Residual Sugar", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$chlorides, h = 0.01, col="DeepSkyBlue")
mtext("chlorides", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$free.sulfur.dioxide, h = 0.05, col="DeepSkyBlue")
mtext("free.sulfur.dioxide", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$total.sulfur.dioxide, h = 0.1, col="DeepSkyBlue")
mtext("total.sulfur.dioxide", side=1, outer=F, line=2, cex=0.8)

par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(dfwr$density, h = 0.1, col="DeepSkyBlue")
mtext("Density", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$pH, h = 0.1, col="DeepSkyBlue")
mtext("PH", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$sulphates, h = 0.05, col="DeepSkyBlue")
mtext("Sulpahtes", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$alcohol, h = 0.1, col="DeepSkyBlue")
mtext("alcohol", side=1, outer=F, line=2, cex=0.8)

summary(dfww)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
# drawing distribution of all attributes for white wine
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
barplot((table(dfww$quality)), col=c("DeepSkyBlue4", "DeepSkyBlue", "DeepSkyBlue1", "DeepSkyBlue2", "DeepSkyBlue3", "DeepSkyBlue4"))
mtext("Quality", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$fixed.acidity, h = 0.5, col="DeepSkyBlue")
mtext("Fixed Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$volatile.acidity, h = 0.05, col="DeepSkyBlue")
mtext("Volatile Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$citric.acid, h = 0.1, col="DeepSkyBlue")
mtext("Citric Acid", side=1, outer=F, line=2, cex=0.8)

par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(dfww$residual.sugar, h = 0.5, col="DeepSkyBlue")
mtext("Residual Sugar", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$chlorides, h = 0.01, col="DeepSkyBlue")
mtext("chlorides", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$free.sulfur.dioxide, h = 0.05, col="DeepSkyBlue")
mtext("free.sulfur.dioxide", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$total.sulfur.dioxide, h = 0.1, col="DeepSkyBlue")
mtext("total.sulfur.dioxide", side=1, outer=F, line=2, cex=0.8)

par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(dfww$density, h = 0.1, col="DeepSkyBlue")
mtext("Density", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$pH, h = 0.1, col="DeepSkyBlue")
mtext("PH", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$sulphates, h = 0.05, col="DeepSkyBlue")
mtext("Sulpahtes", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$alcohol, h = 0.1, col="DeepSkyBlue")
mtext("alcohol", side=1, outer=F, line=2, cex=0.8)

Analysis of single attributes of Red wine.

1.By looking at the summary data we can say that quality is pretty much normally distributes with most values either 5 or 6. fixed and volatile acidity also have a sort of normal distribution citric acid is more uniform with a peak at the lower end

residual.sugar shows that the distribution nearly normal and somewhat right skewed.sulphates and So2 also show the same pattern

pH and density distribution also show normal distribution alcohol is not a normally distribute

2.White wine analysis of attributes

quality,fixed acidity,volatile acidity and citric acid are same as for the red wine. residual sugars and chlorides are rightly skewed. So2 values are somewhat normal. density seem to have a lot outliers. PH is normally distributed

#Boxplots of attributes for red wine
par(mfrow=c(1,6), oma = c(1,1,0,0) + 0.1,  mar = c(3,3,1,1) + 0.1)
boxplot(dfwr$fixed.acidity,  pch=19)
mtext("Fixed Acidity", cex=0.8, side=1, line=2)
boxplot(dfwr$volatile.acidity,  pch=19)
mtext("volatile.acidity", cex=0.8, side=1, line=2)
boxplot(dfwr$citric.acid,  pch=19)
mtext("citric.acid", cex=0.8, side=1, line=2)
boxplot(dfwr$residual.sugar,  pch=19)
mtext("residual.sugar", cex=0.8, side=1, line=2)
boxplot(dfwr$chlorides,  pch=19)
mtext("chlorides", cex=0.8, side=1, line=2)
boxplot(dfwr$free.sulfur.dioxide,  pch=19)
mtext("free.sulfur.dioxide", cex=0.8, side=1, line=2)

par(mfrow=c(1,5), oma = c(1,1,0,0) + 0.1,  mar = c(3,3,1,1) + 0.1)
boxplot(dfwr$total.sulfur.dioxide,  pch=19)
mtext("total.sulfur.dioxide", cex=0.8, side=1, line=2)
boxplot(dfwr$density,  pch=19)
mtext("Density", cex=0.8, side=1, line=2)
boxplot(dfwr$pH,  pch=19)
mtext("PH", cex=0.8, side=1, line=2)
boxplot(dfwr$sulphates,  pch=19)
mtext("Sulphates", cex=0.8, side=1, line=2)
boxplot(dfwr$alcohol,  pch=19)
mtext("Alcohol", cex=0.8, side=1, line=2)

boxplot(dfwr$quality,  pch=19)
mtext("Quality", cex=0.8, side=1, line=2)

#Boxplots of attributes for red wine
par(mfrow=c(1,6), oma = c(1,1,0,0) + 0.1,  mar = c(3,3,1,1) + 0.1)

boxplot(dfww$fixed.acidity,  pch=19)
mtext("Fixed Acidity", cex=0.8, side=1, line=2)
boxplot(dfww$volatile.acidity,  pch=19)
mtext("volatile.acidity", cex=0.8, side=1, line=2)
boxplot(dfww$citric.acid,  pch=19)
mtext("citric.acid", cex=0.8, side=1, line=2)
boxplot(dfww$residual.sugar,  pch=19)
mtext("residual.sugar", cex=0.8, side=1, line=2)
boxplot(dfww$chlorides,  pch=19)
mtext("chlorides", cex=0.8, side=1, line=2)
boxplot(dfww$free.sulfur.dioxide,  pch=19)
mtext("free.sulfur.dioxide", cex=0.8, side=1, line=2)

par(mfrow=c(1,5), oma = c(1,1,0,0) + 0.1,  mar = c(3,3,1,1) + 0.1)
boxplot(dfww$total.sulfur.dioxide,  pch=19)
mtext("total.sulfur.dioxide", cex=0.8, side=1, line=2)
boxplot(dfww$density,  pch=19)
mtext("Density", cex=0.8, side=1, line=2)
boxplot(dfww$pH,  pch=19)
mtext("PH", cex=0.8, side=1, line=2)
boxplot(dfww$sulphates,  pch=19)
mtext("Sulphates", cex=0.8, side=1, line=2)
boxplot(dfww$alcohol,  pch=19)
mtext("Alcohol", cex=0.8, side=1, line=2)

boxplot(dfww$quality,  pch=19)
mtext("Quality", cex=0.8, side=1, line=2)

It looks like by looking at the box plots above there are outliers in almost all the attributes .

#correlations
signif(cor(wr[,colnames(wr)]),3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity               1.0000         -0.25600      0.6720
## volatile.acidity           -0.2560          1.00000     -0.5520
## citric.acid                 0.6720         -0.55200      1.0000
## residual.sugar              0.1150          0.00192      0.1440
## chlorides                   0.0937          0.06130      0.2040
## free.sulfur.dioxide        -0.1540         -0.01050     -0.0610
## total.sulfur.dioxide       -0.1130          0.07650      0.0355
## density                     0.6680          0.02200      0.3650
## pH                         -0.6830          0.23500     -0.5420
## sulphates                   0.1830         -0.26100      0.3130
## alcohol                    -0.0617         -0.20200      0.1100
## quality                     0.1240         -0.39100      0.2260
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity               0.11500   0.09370            -0.15400
## volatile.acidity            0.00192   0.06130            -0.01050
## citric.acid                 0.14400   0.20400            -0.06100
## residual.sugar              1.00000   0.05560             0.18700
## chlorides                   0.05560   1.00000             0.00556
## free.sulfur.dioxide         0.18700   0.00556             1.00000
## total.sulfur.dioxide        0.20300   0.04740             0.66800
## density                     0.35500   0.20100            -0.02190
## pH                         -0.08570  -0.26500             0.07040
## sulphates                   0.00553   0.37100             0.05170
## alcohol                     0.04210  -0.22100            -0.06940
## quality                     0.01370  -0.12900            -0.05070
##                      total.sulfur.dioxide density      pH sulphates
## fixed.acidity                     -0.1130  0.6680 -0.6830   0.18300
## volatile.acidity                   0.0765  0.0220  0.2350  -0.26100
## citric.acid                        0.0355  0.3650 -0.5420   0.31300
## residual.sugar                     0.2030  0.3550 -0.0857   0.00553
## chlorides                          0.0474  0.2010 -0.2650   0.37100
## free.sulfur.dioxide                0.6680 -0.0219  0.0704   0.05170
## total.sulfur.dioxide               1.0000  0.0713 -0.0665   0.04290
## density                            0.0713  1.0000 -0.3420   0.14900
## pH                                -0.0665 -0.3420  1.0000  -0.19700
## sulphates                          0.0429  0.1490 -0.1970   1.00000
## alcohol                           -0.2060 -0.4960  0.2060   0.09360
## quality                           -0.1850 -0.1750 -0.0577   0.25100
##                      alcohol quality
## fixed.acidity        -0.0617  0.1240
## volatile.acidity     -0.2020 -0.3910
## citric.acid           0.1100  0.2260
## residual.sugar        0.0421  0.0137
## chlorides            -0.2210 -0.1290
## free.sulfur.dioxide  -0.0694 -0.0507
## total.sulfur.dioxide -0.2060 -0.1850
## density              -0.4960 -0.1750
## pH                    0.2060 -0.0577
## sulphates             0.0936  0.2510
## alcohol               1.0000  0.4760
## quality               0.4760  1.0000
signif(cor(ww[,colnames(ww)]),3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity               1.0000          -0.0227     0.28900
## volatile.acidity           -0.0227           1.0000    -0.14900
## citric.acid                 0.2890          -0.1490     1.00000
## residual.sugar              0.0890           0.0643     0.09420
## chlorides                   0.0231           0.0705     0.11400
## free.sulfur.dioxide        -0.0494          -0.0970     0.09410
## total.sulfur.dioxide        0.0911           0.0893     0.12100
## density                     0.2650           0.0271     0.15000
## pH                         -0.4260          -0.0319    -0.16400
## sulphates                  -0.0171          -0.0357     0.06230
## alcohol                    -0.1210           0.0677    -0.07570
## quality                    -0.1140          -0.1950    -0.00921
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                0.0890    0.0231           -0.049400
## volatile.acidity             0.0643    0.0705           -0.097000
## citric.acid                  0.0942    0.1140            0.094100
## residual.sugar               1.0000    0.0887            0.299000
## chlorides                    0.0887    1.0000            0.101000
## free.sulfur.dioxide          0.2990    0.1010            1.000000
## total.sulfur.dioxide         0.4010    0.1990            0.616000
## density                      0.8390    0.2570            0.294000
## pH                          -0.1940   -0.0904           -0.000618
## sulphates                   -0.0267    0.0168            0.059200
## alcohol                     -0.4510   -0.3600           -0.250000
## quality                     -0.0976   -0.2100            0.008160
##                      total.sulfur.dioxide density        pH sulphates
## fixed.acidity                     0.09110  0.2650 -0.426000   -0.0171
## volatile.acidity                  0.08930  0.0271 -0.031900   -0.0357
## citric.acid                       0.12100  0.1500 -0.164000    0.0623
## residual.sugar                    0.40100  0.8390 -0.194000   -0.0267
## chlorides                         0.19900  0.2570 -0.090400    0.0168
## free.sulfur.dioxide               0.61600  0.2940 -0.000618    0.0592
## total.sulfur.dioxide              1.00000  0.5300  0.002320    0.1350
## density                           0.53000  1.0000 -0.093600    0.0745
## pH                                0.00232 -0.0936  1.000000    0.1560
## sulphates                         0.13500  0.0745  0.156000    1.0000
## alcohol                          -0.44900 -0.7800  0.121000   -0.0174
## quality                          -0.17500 -0.3070  0.099400    0.0537
##                      alcohol  quality
## fixed.acidity        -0.1210 -0.11400
## volatile.acidity      0.0677 -0.19500
## citric.acid          -0.0757 -0.00921
## residual.sugar       -0.4510 -0.09760
## chlorides            -0.3600 -0.21000
## free.sulfur.dioxide  -0.2500  0.00816
## total.sulfur.dioxide -0.4490 -0.17500
## density              -0.7800 -0.30700
## pH                    0.1210  0.09940
## sulphates            -0.0174  0.05370
## alcohol               1.0000  0.43600
## quality               0.4360  1.00000
Analysis of correlation of attributes red wine

1.fixed acidity has a strong correlation with citric acid which seems natural and citric acid is acidic. 2.one thing to note is the strong relationship between density and fixed acidity. 3.it has a negative correlation with Ph which is strange because a acidic solutions have large PH values. 4.The variables most strongly correlated to quality are Volatile Acidity and Alcohol. citric acid and sulphates also has not so strong correlation. 5.Alcohol has negative correlation with density.

Analysis of correlation of attributes white wine

1.fixed acidity has a strong correlation with citric acid which seems natural and citric acid is acidic. 2.one thing to note is the strong relationship between density and fixed acidity. 3.it has a negative correlation with Ph which is strange because a acidic solutions have large PH values. 4.The variables most strongly correlated to quality are chlorides (-ive),density(-ive) and Alcohol(+ive). citric acid and sulphates also has not so strong correlation. 5.Alcohol has negative correlation with density.

Now creating box plots of red wine attributes against quality to see how they trend

ggplot(data = dfwr, aes(x = quality, y = fixed.acidity)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = volatile.acidity)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = citric.acid)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = residual.sugar)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = chlorides)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = free.sulfur.dioxide)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = total.sulfur.dioxide)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = density)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = pH)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = sulphates)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfwr, aes(x = quality, y = alcohol)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Now creating box plots of white wine attributes against quality to see how they trend

ggplot(data = dfww, aes(x = quality, y = fixed.acidity)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = volatile.acidity)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = citric.acid)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = residual.sugar)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = chlorides)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = free.sulfur.dioxide)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = total.sulfur.dioxide)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = density)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = pH)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = sulphates)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = dfww, aes(x = quality, y = alcohol)) +
  geom_jitter( alpha = .3) +
  geom_boxplot(alpha = .1,color = 'blue') +
  stat_summary(fun.y = "mean",  geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Red wine Analysis

Fixed Acidity has almost no effect on the Quality.

Volatile acid seems to have a negative impact on the quality

more Citric acid more good quality wine

residual sugar has no effect on quality

weak correlation of chlorides with quality.lower values of Chlorides produce good quality wines.

high values of so2 produce better wine then low values of so2

total so2 has same result as above

density has definitely effecting the quality of wine though -ively

PH values also effect quality low PH values better quality although if it is very low quality dereases

sulphates and alcohol has +ive correlation with quality they both increase with quality

White wine analysis

all the attributes are showing same behavior as red wine except the following:

Volatile acid has no effect on the quality

similarly citric acid has no effect

PH has a weak relationship with white wine quality

sulphates has no effect on quality

#Pairs of untransformed attributes
pairs(dfwr);

pairs(dfww);

#summary of untransformed linear regression
mwr<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol,dfwr)
summary(mwr)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = dfwr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.67204 -0.36527 -0.04523  0.45628  2.03894 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4538341  0.6125783   7.271 5.59e-13 ***
## fixed.acidity         0.0081441  0.0160586   0.507  0.61212    
## volatile.acidity     -1.0964449  0.1200866  -9.130  < 2e-16 ***
## citric.acid          -0.1836098  0.1471561  -1.248  0.21232    
## residual.sugar        0.0089507  0.0120542   0.743  0.45787    
## chlorides            -1.9067341  0.4173928  -4.568 5.30e-06 ***
## free.sulfur.dioxide   0.0045147  0.0021631   2.087  0.03704 *  
## total.sulfur.dioxide -0.0033120  0.0007264  -4.560 5.52e-06 ***
## pH                   -0.5042762  0.1571117  -3.210  0.00136 ** 
## sulphates             0.8928974  0.1107548   8.062 1.46e-15 ***
## alcohol               0.2927427  0.0173394  16.883  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6479 on 1588 degrees of freedom
## Multiple R-squared:  0.3603, Adjusted R-squared:  0.3562 
## F-statistic: 89.43 on 10 and 1588 DF,  p-value: < 2.2e-16
mww<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol,dfww)
summary(mww)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = dfww)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9098 -0.4957 -0.0330  0.4666  3.1785 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.0636371  0.3482321   5.926 3.32e-09 ***
## fixed.acidity        -0.0503197  0.0149092  -3.375 0.000744 ***
## volatile.acidity     -1.9583442  0.1138553 -17.200  < 2e-16 ***
## citric.acid          -0.0289483  0.0961455  -0.301 0.763360    
## residual.sugar        0.0256438  0.0025518  10.049  < 2e-16 ***
## chlorides            -0.9525303  0.5425208  -1.756 0.079194 .  
## free.sulfur.dioxide   0.0047672  0.0008391   5.682 1.41e-08 ***
## total.sulfur.dioxide -0.0008697  0.0003730  -2.331 0.019771 *  
## pH                    0.1651688  0.0825418   2.001 0.045444 *  
## sulphates             0.4193440  0.0973099   4.309 1.67e-05 ***
## alcohol               0.3626941  0.0112672  32.190  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.756 on 4887 degrees of freedom
## Multiple R-squared:  0.2727, Adjusted R-squared:  0.2713 
## F-statistic: 183.3 on 10 and 4887 DF,  p-value: < 2.2e-16
# Log Transformed
#summary of untransformed linear regression
cols <- c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","free.sulfur.dioxide","total.sulfur.dioxide","density","pH","sulphates","alcohol","quality")
logdfwr[cols] <- log(dfwr[cols]+1)
logdfww[cols] <- log(dfww[cols]+1)
summary(logdfwr)
##  fixed.acidity   volatile.acidity  citric.acid      residual.sugar  
##  Min.   :1.723   Min.   :0.1133   Min.   :0.00000   Min.   :0.6419  
##  1st Qu.:2.092   1st Qu.:0.3293   1st Qu.:0.08618   1st Qu.:1.0647  
##  Median :2.186   Median :0.4187   Median :0.23111   Median :1.1632  
##  Mean   :2.216   Mean   :0.4172   Mean   :0.22815   Mean   :1.2181  
##  3rd Qu.:2.322   3rd Qu.:0.4947   3rd Qu.:0.35066   3rd Qu.:1.2809  
##  Max.   :2.827   Max.   :0.9478   Max.   :0.69315   Max.   :2.8034  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01193   Min.   :0.6931      Min.   :1.946       
##  1st Qu.:0.06766   1st Qu.:2.0794      1st Qu.:3.135       
##  Median :0.07603   Median :2.7081      Median :3.664       
##  Mean   :0.08304   Mean   :2.6390      Mean   :3.635       
##  3rd Qu.:0.08618   3rd Qu.:3.0910      3rd Qu.:4.143       
##  Max.   :0.47686   Max.   :4.2905      Max.   :5.670       
##     density             pH          sulphates         alcohol     
##  Min.   :0.6882   Min.   :1.319   Min.   :0.2852   Min.   :2.241  
##  1st Qu.:0.6909   1st Qu.:1.437   1st Qu.:0.4383   1st Qu.:2.351  
##  Median :0.6915   Median :1.461   Median :0.4824   Median :2.416  
##  Mean   :0.6915   Mean   :1.461   Mean   :0.5011   Mean   :2.431  
##  3rd Qu.:0.6921   3rd Qu.:1.482   3rd Qu.:0.5481   3rd Qu.:2.493  
##  Max.   :0.6950   Max.   :1.611   Max.   :1.0986   Max.   :2.766  
##     quality     
##  Min.   :1.386  
##  1st Qu.:1.792  
##  Median :1.946  
##  Mean   :1.885  
##  3rd Qu.:1.946  
##  Max.   :2.197
summary(logdfww)
##  fixed.acidity   volatile.acidity   citric.acid     residual.sugar  
##  Min.   :1.569   Min.   :0.07696   Min.   :0.0000   Min.   :0.4700  
##  1st Qu.:1.988   1st Qu.:0.19062   1st Qu.:0.2390   1st Qu.:0.9933  
##  Median :2.054   Median :0.23111   Median :0.2776   Median :1.8245  
##  Mean   :2.055   Mean   :0.24257   Mean   :0.2844   Mean   :1.7522  
##  3rd Qu.:2.116   3rd Qu.:0.27763   3rd Qu.:0.3293   3rd Qu.:2.3888  
##  Max.   :2.721   Max.   :0.74194   Max.   :0.9783   Max.   :4.2017  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00896   Min.   :1.099       Min.   :2.303       
##  1st Qu.:0.03537   1st Qu.:3.178       1st Qu.:4.691       
##  Median :0.04210   Median :3.555       Median :4.905       
##  Mean   :0.04455   Mean   :3.472       Mean   :4.886       
##  3rd Qu.:0.04879   3rd Qu.:3.850       3rd Qu.:5.124       
##  Max.   :0.29714   Max.   :5.670       Max.   :6.089       
##     density             pH          sulphates         alcohol     
##  Min.   :0.6867   Min.   :1.314   Min.   :0.1989   Min.   :2.197  
##  1st Qu.:0.6890   1st Qu.:1.409   1st Qu.:0.3436   1st Qu.:2.351  
##  Median :0.6900   Median :1.430   Median :0.3853   Median :2.434  
##  Mean   :0.6902   Mean   :1.432   Mean   :0.3959   Mean   :2.438  
##  3rd Qu.:0.6912   3rd Qu.:1.454   3rd Qu.:0.4383   3rd Qu.:2.518  
##  Max.   :0.7124   Max.   :1.573   Max.   :0.7324   Max.   :2.721  
##     quality     
##  Min.   :1.386  
##  1st Qu.:1.792  
##  Median :1.946  
##  Mean   :1.920  
##  3rd Qu.:1.946  
##  Max.   :2.303
#correlations
signif(cor(logdfwr[,colnames(wr)]),3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000          -0.2610      0.6620
## volatile.acidity            -0.261           1.0000     -0.5750
## citric.acid                  0.662          -0.5750      1.0000
## residual.sugar               0.159           0.0242      0.1640
## chlorides                    0.120           0.0726      0.1890
## free.sulfur.dioxide         -0.178           0.0207     -0.0796
## total.sulfur.dioxide        -0.114           0.0841      0.0128
## density                      0.674           0.0300      0.3590
## pH                          -0.704           0.2320     -0.5440
## sulphates                    0.191          -0.2830      0.3200
## alcohol                     -0.090          -0.2140      0.0997
## quality                      0.113          -0.3960      0.2200
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                0.1590   0.12000            -0.17800
## volatile.acidity             0.0242   0.07260             0.02070
## citric.acid                  0.1640   0.18900            -0.07960
## residual.sugar               1.0000   0.05590             0.10000
## chlorides                    0.0559   1.00000            -0.00557
## free.sulfur.dioxide          0.1000  -0.00557             1.00000
## total.sulfur.dioxide         0.1540   0.06220             0.78400
## density                      0.4060   0.21900            -0.03960
## pH                          -0.0896  -0.27300             0.09580
## sulphates                    0.0156   0.33800             0.05530
## alcohol                      0.0751  -0.23600            -0.08320
## quality                      0.0173  -0.13400            -0.03870
##                      total.sulfur.dioxide density      pH sulphates
## fixed.acidity                     -0.1140  0.6740 -0.7040    0.1910
## volatile.acidity                   0.0841  0.0300  0.2320   -0.2830
## citric.acid                        0.0128  0.3590 -0.5440    0.3200
## residual.sugar                     0.1540  0.4060 -0.0896    0.0156
## chlorides                          0.0622  0.2190 -0.2730    0.3380
## free.sulfur.dioxide                0.7840 -0.0396  0.0958    0.0553
## total.sulfur.dioxide               1.0000  0.1040 -0.0171    0.0593
## density                            0.1040  1.0000 -0.3410    0.1570
## pH                                -0.0171 -0.3410  1.0000   -0.1840
## sulphates                          0.0593  0.1570 -0.1840    1.0000
## alcohol                           -0.2370 -0.4920  0.2030    0.1150
## quality                           -0.1550 -0.1670 -0.0603    0.2760
##                      alcohol quality
## fixed.acidity        -0.0900  0.1130
## volatile.acidity     -0.2140 -0.3960
## citric.acid           0.0997  0.2200
## residual.sugar        0.0751  0.0173
## chlorides            -0.2360 -0.1340
## free.sulfur.dioxide  -0.0832 -0.0387
## total.sulfur.dioxide -0.2370 -0.1550
## density              -0.4920 -0.1670
## pH                    0.2030 -0.0603
## sulphates             0.1150  0.2760
## alcohol               1.0000  0.4590
## quality               0.4590  1.0000
signif(cor(logdfww[,colnames(ww)]),3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity               1.0000          -0.0306      0.3040
## volatile.acidity           -0.0306           1.0000     -0.1710
## citric.acid                 0.3040          -0.1710      1.0000
## residual.sugar              0.0874           0.0925      0.0710
## chlorides                   0.0341           0.0682      0.1070
## free.sulfur.dioxide        -0.0465          -0.1130      0.0869
## total.sulfur.dioxide        0.0849           0.0719      0.1150
## density                     0.2760           0.0253      0.1460
## pH                         -0.4350          -0.0346     -0.1660
## sulphates                  -0.0153          -0.0373      0.0672
## alcohol                    -0.1250           0.0577     -0.0689
## quality                    -0.1120          -0.2090      0.0100
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                0.0874    0.0341             -0.0465
## volatile.acidity             0.0925    0.0682             -0.1130
## citric.acid                  0.0710    0.1070              0.0869
## residual.sugar               1.0000    0.0836              0.3260
## chlorides                    0.0836    1.0000              0.0957
## free.sulfur.dioxide          0.3260    0.0957              1.0000
## total.sulfur.dioxide         0.4090    0.2040              0.6290
## density                      0.7780    0.2690              0.2850
## pH                          -0.1840   -0.0906              0.0217
## sulphates                   -0.0324    0.0237              0.0631
## alcohol                     -0.4250   -0.3750             -0.2310
## quality                     -0.0686   -0.2120              0.1050
##                      total.sulfur.dioxide density      pH sulphates
## fixed.acidity                      0.0849  0.2760 -0.4350   -0.0153
## volatile.acidity                   0.0719  0.0253 -0.0346   -0.0373
## citric.acid                        0.1150  0.1460 -0.1660    0.0672
## residual.sugar                     0.4090  0.7780 -0.1840   -0.0324
## chlorides                          0.2040  0.2690 -0.0906    0.0237
## free.sulfur.dioxide                0.6290  0.2850  0.0217    0.0631
## total.sulfur.dioxide               1.0000  0.5060  0.0179    0.1410
## density                            0.5060  1.0000 -0.0948    0.0823
## pH                                 0.0179 -0.0948  1.0000    0.1580
## sulphates                          0.1410  0.0823  0.1580    1.0000
## alcohol                           -0.4300 -0.7860  0.1290   -0.0252
## quality                           -0.1160 -0.2980  0.0957    0.0500
##                      alcohol quality
## fixed.acidity        -0.1250 -0.1120
## volatile.acidity      0.0577 -0.2090
## citric.acid          -0.0689  0.0100
## residual.sugar       -0.4250 -0.0686
## chlorides            -0.3750 -0.2120
## free.sulfur.dioxide  -0.2310  0.1050
## total.sulfur.dioxide -0.4300 -0.1160
## density              -0.7860 -0.2980
## pH                    0.1290  0.0957
## sulphates            -0.0252  0.0500
## alcohol               1.0000  0.4200
## quality               0.4200  1.0000
mwr<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,logdfwr)
summary(mwr)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = logdfwr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51177 -0.05083 -0.00499  0.06926  0.27889 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.327371   4.749294   1.122  0.26215    
## fixed.acidity         0.058952   0.039964   1.475  0.14038    
## volatile.acidity     -0.274983   0.029618  -9.284  < 2e-16 ***
## citric.acid          -0.063666   0.028694  -2.219  0.02664 *  
## residual.sugar        0.010098   0.012707   0.795  0.42694    
## chlorides            -0.315844   0.076437  -4.132 3.78e-05 ***
## free.sulfur.dioxide   0.016154   0.006785   2.381  0.01738 *  
## total.sulfur.dioxide -0.020126   0.006523  -3.085  0.00207 ** 
## density              -6.251071   7.025402  -0.890  0.37372    
## pH                   -0.228828   0.131154  -1.745  0.08123 .  
## sulphates             0.267088   0.032067   8.329  < 2e-16 ***
## alcohol               0.462217   0.049030   9.427  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09955 on 1587 degrees of freedom
## Multiple R-squared:  0.3468, Adjusted R-squared:  0.3423 
## F-statistic: 76.61 on 11 and 1587 DF,  p-value: < 2.2e-16
mww<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,logdfww)
summary(mww)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = logdfww)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62344 -0.06828  0.00104  0.07211  0.47466 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           13.067607   2.509740   5.207 2.00e-07 ***
## fixed.acidity         -0.003205   0.020846  -0.154  0.87783    
## volatile.acidity      -0.379659   0.022403 -16.947  < 2e-16 ***
## citric.acid            0.014139   0.019555   0.723  0.46970    
## residual.sugar         0.045812   0.004989   9.183  < 2e-16 ***
## chlorides             -0.154266   0.087524  -1.763  0.07804 .  
## free.sulfur.dioxide    0.043017   0.004095  10.506  < 2e-16 ***
## total.sulfur.dioxide  -0.018855   0.007018  -2.686  0.00725 ** 
## density              -18.323801   3.642544  -5.030 5.07e-07 ***
## pH                     0.184960   0.057225   3.232  0.00124 ** 
## sulphates              0.113451   0.022213   5.107 3.39e-07 ***
## alcohol                0.472880   0.033085  14.293  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1102 on 4886 degrees of freedom
## Multiple R-squared:  0.288,  Adjusted R-squared:  0.2864 
## F-statistic: 179.7 on 11 and 4886 DF,  p-value: < 2.2e-16
#Pairs of log transformed
pairs(logdfwr);

pairs(logdfww);

# square Transformed
cols <- c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","free.sulfur.dioxide","total.sulfur.dioxide","density","pH","sulphates","alcohol","quality")
sqrtdfwr[cols] <- sqrt(dfwr[cols]+1)
sqrtdfww[cols] <- sqrt(dfww[cols]+1)
summary(sqrtdfwr)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar 
##  Min.   :2.366   Min.   :1.058    Min.   :1.000   Min.   :1.378  
##  1st Qu.:2.846   1st Qu.:1.179    1st Qu.:1.044   1st Qu.:1.703  
##  Median :2.983   Median :1.233    Median :1.122   Median :1.789  
##  Mean   :3.040   Mean   :1.234    Mean   :1.124   Mean   :1.857  
##  3rd Qu.:3.194   3rd Qu.:1.281    3rd Qu.:1.192   3rd Qu.:1.897  
##  Max.   :4.111   Max.   :1.606    Max.   :1.414   Max.   :4.062  
##    chlorides     free.sulfur.dioxide total.sulfur.dioxide    density     
##  Min.   :1.006   Min.   :1.414       Min.   : 2.646       Min.   :1.411  
##  1st Qu.:1.034   1st Qu.:2.828       1st Qu.: 4.796       1st Qu.:1.413  
##  Median :1.039   Median :3.873       Median : 6.245       Median :1.413  
##  Mean   :1.043   Mean   :3.925       Mean   : 6.521       Mean   :1.413  
##  3rd Qu.:1.044   3rd Qu.:4.690       3rd Qu.: 7.937       3rd Qu.:1.413  
##  Max.   :1.269   Max.   :8.544       Max.   :17.029       Max.   :1.416  
##        pH          sulphates        alcohol         quality     
##  Min.   :1.934   Min.   :1.153   Min.   :3.066   Min.   :2.000  
##  1st Qu.:2.052   1st Qu.:1.245   1st Qu.:3.240   1st Qu.:2.449  
##  Median :2.076   Median :1.273   Median :3.347   Median :2.646  
##  Mean   :2.076   Mean   :1.286   Mean   :3.376   Mean   :2.571  
##  3rd Qu.:2.098   3rd Qu.:1.315   3rd Qu.:3.479   3rd Qu.:2.646  
##  Max.   :2.238   Max.   :1.732   Max.   :3.987   Max.   :3.000
summary(sqrtdfww)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar 
##  Min.   :2.191   Min.   :1.039    Min.   :1.000   Min.   :1.265  
##  1st Qu.:2.702   1st Qu.:1.100    1st Qu.:1.127   1st Qu.:1.643  
##  Median :2.793   Median :1.122    Median :1.149   Median :2.490  
##  Mean   :2.799   Mean   :1.130    Mean   :1.154   Mean   :2.561  
##  3rd Qu.:2.881   3rd Qu.:1.149    3rd Qu.:1.179   3rd Qu.:3.302  
##  Max.   :3.899   Max.   :1.449    Max.   :1.631   Max.   :8.173  
##    chlorides     free.sulfur.dioxide total.sulfur.dioxide    density     
##  Min.   :1.004   Min.   : 1.732      Min.   : 3.162       Min.   :1.410  
##  1st Qu.:1.018   1st Qu.: 4.899      1st Qu.:10.440       1st Qu.:1.411  
##  Median :1.021   Median : 5.916      Median :11.619       Median :1.412  
##  Mean   :1.023   Mean   : 5.859      Mean   :11.662       Mean   :1.412  
##  3rd Qu.:1.025   3rd Qu.: 6.856      3rd Qu.:12.961       3rd Qu.:1.413  
##  Max.   :1.160   Max.   :17.029      Max.   :21.000       Max.   :1.428  
##        pH          sulphates        alcohol         quality     
##  Min.   :1.929   Min.   :1.105   Min.   :3.000   Min.   :2.000  
##  1st Qu.:2.022   1st Qu.:1.187   1st Qu.:3.240   1st Qu.:2.449  
##  Median :2.045   Median :1.212   Median :3.376   Median :2.646  
##  Mean   :2.046   Mean   :1.220   Mean   :3.389   Mean   :2.617  
##  3rd Qu.:2.069   3rd Qu.:1.245   3rd Qu.:3.521   3rd Qu.:2.646  
##  Max.   :2.195   Max.   :1.442   Max.   :3.899   Max.   :3.162
#correlations
signif(cor(sqrtdfwr[,colnames(wr)]),3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity               1.0000          -0.2600      0.6680
## volatile.acidity           -0.2600           1.0000     -0.5650
## citric.acid                 0.6680          -0.5650      1.0000
## residual.sugar              0.1380           0.0135      0.1560
## chlorides                   0.1070           0.0670      0.1960
## free.sulfur.dioxide        -0.1690           0.0031     -0.0711
## total.sulfur.dioxide       -0.1160           0.0822      0.0240
## density                     0.6720           0.0261      0.3620
## pH                         -0.6950           0.2340     -0.5440
## sulphates                   0.1890          -0.2730      0.3170
## alcohol                    -0.0756          -0.2080      0.1050
## quality                     0.1190          -0.3950      0.2240
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                0.1380   0.10700            -0.16900
## volatile.acidity             0.0135   0.06700             0.00310
## citric.acid                  0.1560   0.19600            -0.07110
## residual.sugar               1.0000   0.05510             0.13900
## chlorides                    0.0551   1.00000            -0.00108
## free.sulfur.dioxide          0.1390  -0.00108             1.00000
## total.sulfur.dioxide         0.1810   0.05730             0.73700
## density                      0.3830   0.21000            -0.03300
## pH                          -0.0888  -0.26900             0.08490
## sulphates                    0.0104   0.35500             0.05490
## alcohol                      0.0606  -0.22900            -0.07650
## quality                      0.0159  -0.13100            -0.04650
##                      total.sulfur.dioxide density      pH sulphates
## fixed.acidity                     -0.1160  0.6720 -0.6950    0.1890
## volatile.acidity                   0.0822  0.0261  0.2340   -0.2730
## citric.acid                        0.0240  0.3620 -0.5440    0.3170
## residual.sugar                     0.1810  0.3830 -0.0888    0.0104
## chlorides                          0.0573  0.2100 -0.2690    0.3550
## free.sulfur.dioxide                0.7370 -0.0330  0.0849    0.0549
## total.sulfur.dioxide               1.0000  0.0894 -0.0412    0.0490
## density                            0.0894  1.0000 -0.3410    0.1530
## pH                                -0.0412 -0.3410  1.0000   -0.1900
## sulphates                          0.0490  0.1530 -0.1900    1.0000
## alcohol                           -0.2270 -0.4940  0.2040    0.1050
## quality                           -0.1780 -0.1710 -0.0590    0.2650
##                      alcohol quality
## fixed.acidity        -0.0756  0.1190
## volatile.acidity     -0.2080 -0.3950
## citric.acid           0.1050  0.2240
## residual.sugar        0.0606  0.0159
## chlorides            -0.2290 -0.1310
## free.sulfur.dioxide  -0.0765 -0.0465
## total.sulfur.dioxide -0.2270 -0.1780
## density              -0.4940 -0.1710
## pH                    0.2040 -0.0590
## sulphates             0.1050  0.2650
## alcohol               1.0000  0.4680
## quality               0.4680  1.0000
signif(cor(sqrtdfww[,colnames(ww)]),3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity               1.0000          -0.0269    0.297000
## volatile.acidity           -0.0269           1.0000   -0.161000
## citric.acid                 0.2970          -0.1610    1.000000
## residual.sugar              0.0897           0.0763    0.083700
## chlorides                   0.0285           0.0694    0.111000
## free.sulfur.dioxide        -0.0484          -0.1070    0.094600
## total.sulfur.dioxide        0.0893           0.0827    0.119000
## density                     0.2710           0.0261    0.148000
## pH                         -0.4310          -0.0333   -0.165000
## sulphates                  -0.0164          -0.0367    0.064900
## alcohol                    -0.1230           0.0628   -0.072600
## quality                    -0.1130          -0.2020    0.000202
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                0.0897    0.0285            -0.04840
## volatile.acidity             0.0763    0.0694            -0.10700
## citric.acid                  0.0837    0.1110             0.09460
## residual.sugar               1.0000    0.0875             0.32800
## chlorides                    0.0875    1.0000             0.10100
## free.sulfur.dioxide          0.3280    0.1010             1.00000
## total.sulfur.dioxide         0.4170    0.2040             0.62800
## density                      0.8160    0.2630             0.29900
## pH                          -0.1930   -0.0905             0.00855
## sulphates                   -0.0316    0.0202             0.06040
## alcohol                     -0.4470   -0.3680            -0.24800
## quality                     -0.0861   -0.2120             0.05440
##                      total.sulfur.dioxide density       pH sulphates
## fixed.acidity                     0.08930  0.2710 -0.43100   -0.0164
## volatile.acidity                  0.08270  0.0261 -0.03330   -0.0367
## citric.acid                       0.11900  0.1480 -0.16500    0.0649
## residual.sugar                    0.41700  0.8160 -0.19300   -0.0316
## chlorides                         0.20400  0.2630 -0.09050    0.0202
## free.sulfur.dioxide               0.62800  0.2990  0.00855    0.0604
## total.sulfur.dioxide              1.00000  0.5260  0.00955    0.1380
## density                           0.52600  1.0000 -0.09420    0.0784
## pH                                0.00955 -0.0942  1.00000    0.1570
## sulphates                         0.13800  0.0784  0.15700    1.0000
## alcohol                          -0.44600 -0.7830  0.12500   -0.0214
## quality                          -0.15000 -0.3030  0.09780    0.0518
##                      alcohol   quality
## fixed.acidity        -0.1230 -0.113000
## volatile.acidity      0.0628 -0.202000
## citric.acid          -0.0726  0.000202
## residual.sugar       -0.4470 -0.086100
## chlorides            -0.3680 -0.212000
## free.sulfur.dioxide  -0.2480  0.054400
## total.sulfur.dioxide -0.4460 -0.150000
## density              -0.7830 -0.303000
## pH                    0.1250  0.097800
## sulphates            -0.0214  0.051800
## alcohol               1.0000  0.429000
## quality               0.4290  1.000000
mwr<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,sqrtdfwr)
summary(mwr)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = sqrtdfwr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58663 -0.06835 -0.00742  0.08739  0.36943 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           17.656340  17.363964   1.017   0.3094    
## fixed.acidity          0.038927   0.032800   1.187   0.2355    
## volatile.acidity      -0.544648   0.059954  -9.084  < 2e-16 ***
## citric.acid           -0.111235   0.065160  -1.707   0.0880 .  
## residual.sugar         0.013494   0.014351   0.940   0.3472    
## chlorides             -0.767905   0.178518  -4.302 1.80e-05 ***
## free.sulfur.dioxide    0.009786   0.004049   2.417   0.0158 *  
## total.sulfur.dioxide  -0.009285   0.002331  -3.984 7.09e-05 ***
## density              -10.477681  12.461518  -0.841   0.4006    
## pH                    -0.308713   0.159589  -1.934   0.0532 .  
## sulphates              0.497197   0.060570   8.209 4.58e-16 ***
## alcohol                0.354957   0.036287   9.782  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1264 on 1587 degrees of freedom
## Multiple R-squared:  0.3557, Adjusted R-squared:  0.3512 
## F-statistic: 79.63 on 11 and 1587 DF,  p-value: < 2.2e-16
mww<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,sqrtdfww)
summary(mww)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = sqrtdfww)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76917 -0.09107 -0.00360  0.09041  0.69875 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           78.120546  11.473447   6.809 1.10e-11 ***
## fixed.acidity          0.031764   0.020585   1.543    0.123    
## volatile.acidity      -0.845340   0.050416 -16.767  < 2e-16 ***
## citric.acid            0.019880   0.043267   0.459    0.646    
## residual.sugar         0.065247   0.006379  10.228  < 2e-16 ***
## chlorides             -0.228175   0.217964  -1.047    0.295    
## free.sulfur.dioxide    0.014719   0.001979   7.438 1.20e-13 ***
## total.sulfur.dioxide  -0.003540   0.001679  -2.109    0.035 *  
## density              -54.410476   8.197226  -6.638 3.53e-11 ***
## pH                     0.385046   0.076778   5.015 5.49e-07 ***
## sulphates              0.272193   0.047097   5.779 7.96e-09 ***
## alcohol                0.316998   0.027597  11.487  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1434 on 4886 degrees of freedom
## Multiple R-squared:  0.2851, Adjusted R-squared:  0.2835 
## F-statistic: 177.1 on 11 and 4886 DF,  p-value: < 2.2e-16
#Pairs of sqrt transformed
pairs(logdfwr);

pairs(logdfww);

The summary of liner regression for untrsnsformed , log transformed and sqrt trandformed are having almost the same R^2 but the RSE is lowest for the log transformed dataset. so it is better to consider log transformed dataset. by analyzing the data the following attributes can be considered as potentials predictors for red wine in increasing order of the correlation: 1. alcohol 2. volatile.acidity 3. sulphates 4.citrix.acid

For white wine following are the attributes 1.volatile.acidity 2.chlorides 3.alcohol

Regarding paiwaise we se strong corelation between Ph and fixed.acidity . Similarly there is one between total.sulfur.dioxide and free.sulfur.dioxide.

In order to show the corrleation between the pairs and the predictors we are drawing correlation matrix specific for them

#for red wine  we will draw correlation between quality and  alcohol,volatile.acidity,sulphates
ggplot(logdfwr,aes(x=quality,y=alcohol)) +  geom_point() + geom_smooth(method = "lm", se = FALSE)

ggplot(logdfwr,aes(x=quality,y=sulphates)) +  geom_point() + geom_smooth(method = "lm", se = FALSE)

ggplot(logdfwr,aes(x=quality,y=volatile.acidity)) +  geom_point() + geom_smooth(method = "lm", se = FALSE)

#for white wine  we will draw correlation between quality and  alcohol,volatile.acidity,sulphates
ggplot(logdfww,aes(x=quality,y=volatile.acidity)) +  geom_point() + geom_smooth(method = "lm", se = FALSE)

ggplot(logdfww,aes(x=quality,y=alcohol)) +  geom_point() + geom_smooth(method = "lm", se = FALSE)

ggplot(logdfww,aes(x=quality,y=chlorides)) +  geom_point() + geom_smooth(method = "lm", se = FALSE)

Sub-problem 2: choose optimal models by exhaustive, forward and backward selection (20 points)

Use regsubsets from library leaps to choose optimal set of variables for modeling wine quality for red and white wine (separately), describe differences and similarities between attributes deemed important in each case.

#Redwine
summary(regsubsets(quality ~ .,logdfwr,method="exhaustive"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfwr, method = "exhaustive")
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          fixed.acidity volatile.acidity citric.acid residual.sugar
## 1  ( 1 ) " "           " "              " "         " "           
## 2  ( 1 ) " "           "*"              " "         " "           
## 3  ( 1 ) " "           "*"              " "         " "           
## 4  ( 1 ) " "           "*"              " "         " "           
## 5  ( 1 ) " "           "*"              " "         " "           
## 6  ( 1 ) " "           "*"              " "         " "           
## 7  ( 1 ) " "           "*"              " "         " "           
## 8  ( 1 ) " "           "*"              "*"         " "           
##          chlorides free.sulfur.dioxide total.sulfur.dioxide density pH 
## 1  ( 1 ) " "       " "                 " "                  " "     " "
## 2  ( 1 ) " "       " "                 " "                  " "     " "
## 3  ( 1 ) " "       " "                 " "                  " "     " "
## 4  ( 1 ) "*"       " "                 " "                  " "     " "
## 5  ( 1 ) "*"       " "                 " "                  " "     "*"
## 6  ( 1 ) "*"       " "                 "*"                  " "     "*"
## 7  ( 1 ) "*"       "*"                 "*"                  " "     "*"
## 8  ( 1 ) "*"       "*"                 "*"                  " "     "*"
##          sulphates alcohol
## 1  ( 1 ) " "       "*"    
## 2  ( 1 ) " "       "*"    
## 3  ( 1 ) "*"       "*"    
## 4  ( 1 ) "*"       "*"    
## 5  ( 1 ) "*"       "*"    
## 6  ( 1 ) "*"       "*"    
## 7  ( 1 ) "*"       "*"    
## 8  ( 1 ) "*"       "*"
summary(regsubsets(quality ~ .,logdfwr,method="backward"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfwr, method = "backward")
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
##          fixed.acidity volatile.acidity citric.acid residual.sugar
## 1  ( 1 ) " "           " "              " "         " "           
## 2  ( 1 ) " "           "*"              " "         " "           
## 3  ( 1 ) " "           "*"              " "         " "           
## 4  ( 1 ) " "           "*"              " "         " "           
## 5  ( 1 ) " "           "*"              " "         " "           
## 6  ( 1 ) " "           "*"              " "         " "           
## 7  ( 1 ) " "           "*"              " "         " "           
## 8  ( 1 ) " "           "*"              "*"         " "           
##          chlorides free.sulfur.dioxide total.sulfur.dioxide density pH 
## 1  ( 1 ) " "       " "                 " "                  " "     " "
## 2  ( 1 ) " "       " "                 " "                  " "     " "
## 3  ( 1 ) " "       " "                 " "                  " "     " "
## 4  ( 1 ) "*"       " "                 " "                  " "     " "
## 5  ( 1 ) "*"       " "                 " "                  " "     "*"
## 6  ( 1 ) "*"       " "                 "*"                  " "     "*"
## 7  ( 1 ) "*"       "*"                 "*"                  " "     "*"
## 8  ( 1 ) "*"       "*"                 "*"                  " "     "*"
##          sulphates alcohol
## 1  ( 1 ) " "       "*"    
## 2  ( 1 ) " "       "*"    
## 3  ( 1 ) "*"       "*"    
## 4  ( 1 ) "*"       "*"    
## 5  ( 1 ) "*"       "*"    
## 6  ( 1 ) "*"       "*"    
## 7  ( 1 ) "*"       "*"    
## 8  ( 1 ) "*"       "*"
summary(regsubsets(quality ~ . ,logdfwr,method="forward"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfwr, method = "forward")
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          fixed.acidity volatile.acidity citric.acid residual.sugar
## 1  ( 1 ) " "           " "              " "         " "           
## 2  ( 1 ) " "           "*"              " "         " "           
## 3  ( 1 ) " "           "*"              " "         " "           
## 4  ( 1 ) " "           "*"              " "         " "           
## 5  ( 1 ) " "           "*"              " "         " "           
## 6  ( 1 ) " "           "*"              " "         " "           
## 7  ( 1 ) " "           "*"              " "         " "           
## 8  ( 1 ) " "           "*"              "*"         " "           
##          chlorides free.sulfur.dioxide total.sulfur.dioxide density pH 
## 1  ( 1 ) " "       " "                 " "                  " "     " "
## 2  ( 1 ) " "       " "                 " "                  " "     " "
## 3  ( 1 ) " "       " "                 " "                  " "     " "
## 4  ( 1 ) "*"       " "                 " "                  " "     " "
## 5  ( 1 ) "*"       " "                 " "                  " "     "*"
## 6  ( 1 ) "*"       " "                 "*"                  " "     "*"
## 7  ( 1 ) "*"       "*"                 "*"                  " "     "*"
## 8  ( 1 ) "*"       "*"                 "*"                  " "     "*"
##          sulphates alcohol
## 1  ( 1 ) " "       "*"    
## 2  ( 1 ) " "       "*"    
## 3  ( 1 ) "*"       "*"    
## 4  ( 1 ) "*"       "*"    
## 5  ( 1 ) "*"       "*"    
## 6  ( 1 ) "*"       "*"    
## 7  ( 1 ) "*"       "*"    
## 8  ( 1 ) "*"       "*"
summary(regsubsets(quality ~ .,logdfwr,method="seqrep"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfwr, method = "seqrep")
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
##          fixed.acidity volatile.acidity citric.acid residual.sugar
## 1  ( 1 ) " "           " "              " "         " "           
## 2  ( 1 ) " "           "*"              " "         " "           
## 3  ( 1 ) " "           "*"              " "         " "           
## 4  ( 1 ) " "           "*"              " "         " "           
## 5  ( 1 ) " "           "*"              " "         " "           
## 6  ( 1 ) " "           "*"              " "         " "           
## 7  ( 1 ) " "           "*"              " "         " "           
## 8  ( 1 ) "*"           "*"              "*"         "*"           
##          chlorides free.sulfur.dioxide total.sulfur.dioxide density pH 
## 1  ( 1 ) " "       " "                 " "                  " "     " "
## 2  ( 1 ) " "       " "                 " "                  " "     " "
## 3  ( 1 ) " "       " "                 " "                  " "     " "
## 4  ( 1 ) "*"       " "                 " "                  " "     " "
## 5  ( 1 ) "*"       " "                 " "                  " "     "*"
## 6  ( 1 ) "*"       " "                 "*"                  " "     "*"
## 7  ( 1 ) "*"       "*"                 "*"                  " "     "*"
## 8  ( 1 ) "*"       "*"                 "*"                  "*"     " "
##          sulphates alcohol
## 1  ( 1 ) " "       "*"    
## 2  ( 1 ) " "       "*"    
## 3  ( 1 ) "*"       "*"    
## 4  ( 1 ) "*"       "*"    
## 5  ( 1 ) "*"       "*"    
## 6  ( 1 ) "*"       "*"    
## 7  ( 1 ) "*"       "*"    
## 8  ( 1 ) " "       " "
summary(regsubsets(quality ~ .,logdfwr,method="seqrep"))$which
##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 1        TRUE         FALSE            FALSE       FALSE          FALSE
## 2        TRUE         FALSE             TRUE       FALSE          FALSE
## 3        TRUE         FALSE             TRUE       FALSE          FALSE
## 4        TRUE         FALSE             TRUE       FALSE          FALSE
## 5        TRUE         FALSE             TRUE       FALSE          FALSE
## 6        TRUE         FALSE             TRUE       FALSE          FALSE
## 7        TRUE         FALSE             TRUE       FALSE          FALSE
## 8        TRUE          TRUE             TRUE        TRUE           TRUE
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density    pH
## 1     FALSE               FALSE                FALSE   FALSE FALSE
## 2     FALSE               FALSE                FALSE   FALSE FALSE
## 3     FALSE               FALSE                FALSE   FALSE FALSE
## 4      TRUE               FALSE                FALSE   FALSE FALSE
## 5      TRUE               FALSE                FALSE   FALSE  TRUE
## 6      TRUE               FALSE                 TRUE   FALSE  TRUE
## 7      TRUE                TRUE                 TRUE   FALSE  TRUE
## 8      TRUE                TRUE                 TRUE    TRUE FALSE
##   sulphates alcohol
## 1     FALSE    TRUE
## 2     FALSE    TRUE
## 3      TRUE    TRUE
## 4      TRUE    TRUE
## 5      TRUE    TRUE
## 6      TRUE    TRUE
## 7      TRUE    TRUE
## 8     FALSE   FALSE
plot(regsubsets(quality ~ .,logdfwr))

summaryMetrics <-  NULL
whichAll <- list()

for ( myMthd in c("exhaustive", "backward", "forward") ) {
  rsRes <- regsubsets(quality~.,logdfwr,method=myMthd,nvmax=11)
  summRes <- summary(rsRes)
  whichAll[[myMthd]] <- summRes$which
  for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
    summaryMetrics <- rbind(summaryMetrics,
      data.frame(method=myMthd,metric=metricName,
                nvars=1:length(summRes[[metricName]]),
                value=summRes[[metricName]]))
  }
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") +   theme(legend.position="top")

old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in names(whichAll) ) {
  image(1:nrow(whichAll[[myMthd]]),
        1:ncol(whichAll[[myMthd]]),
        whichAll[[myMthd]],xlab="N(vars)",ylab="",
        xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
        col=c("white","gray"),main=myMthd)
  axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
  axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
#white wine
summary(regsubsets(quality ~ .,logdfww,method="exhaustive"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfww, method = "exhaustive")
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          fixed.acidity volatile.acidity citric.acid residual.sugar
## 1  ( 1 ) " "           " "              " "         " "           
## 2  ( 1 ) " "           "*"              " "         " "           
## 3  ( 1 ) " "           "*"              " "         " "           
## 4  ( 1 ) " "           "*"              " "         "*"           
## 5  ( 1 ) " "           "*"              " "         "*"           
## 6  ( 1 ) " "           "*"              " "         "*"           
## 7  ( 1 ) " "           "*"              " "         "*"           
## 8  ( 1 ) " "           "*"              " "         "*"           
##          chlorides free.sulfur.dioxide total.sulfur.dioxide density pH 
## 1  ( 1 ) " "       " "                 " "                  " "     " "
## 2  ( 1 ) " "       " "                 " "                  " "     " "
## 3  ( 1 ) " "       "*"                 " "                  " "     " "
## 4  ( 1 ) " "       "*"                 " "                  " "     " "
## 5  ( 1 ) " "       "*"                 " "                  "*"     " "
## 6  ( 1 ) " "       "*"                 " "                  "*"     " "
## 7  ( 1 ) " "       "*"                 " "                  "*"     "*"
## 8  ( 1 ) " "       "*"                 "*"                  "*"     "*"
##          sulphates alcohol
## 1  ( 1 ) " "       "*"    
## 2  ( 1 ) " "       "*"    
## 3  ( 1 ) " "       "*"    
## 4  ( 1 ) " "       "*"    
## 5  ( 1 ) " "       "*"    
## 6  ( 1 ) "*"       "*"    
## 7  ( 1 ) "*"       "*"    
## 8  ( 1 ) "*"       "*"
summary(regsubsets(quality ~ .,logdfww,method="backward"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfww, method = "backward")
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
##          fixed.acidity volatile.acidity citric.acid residual.sugar
## 1  ( 1 ) " "           " "              " "         " "           
## 2  ( 1 ) " "           "*"              " "         " "           
## 3  ( 1 ) " "           "*"              " "         " "           
## 4  ( 1 ) " "           "*"              " "         "*"           
## 5  ( 1 ) " "           "*"              " "         "*"           
## 6  ( 1 ) " "           "*"              " "         "*"           
## 7  ( 1 ) " "           "*"              " "         "*"           
## 8  ( 1 ) " "           "*"              " "         "*"           
##          chlorides free.sulfur.dioxide total.sulfur.dioxide density pH 
## 1  ( 1 ) " "       " "                 " "                  " "     " "
## 2  ( 1 ) " "       " "                 " "                  " "     " "
## 3  ( 1 ) " "       "*"                 " "                  " "     " "
## 4  ( 1 ) " "       "*"                 " "                  " "     " "
## 5  ( 1 ) " "       "*"                 " "                  "*"     " "
## 6  ( 1 ) " "       "*"                 " "                  "*"     " "
## 7  ( 1 ) " "       "*"                 " "                  "*"     "*"
## 8  ( 1 ) " "       "*"                 "*"                  "*"     "*"
##          sulphates alcohol
## 1  ( 1 ) " "       "*"    
## 2  ( 1 ) " "       "*"    
## 3  ( 1 ) " "       "*"    
## 4  ( 1 ) " "       "*"    
## 5  ( 1 ) " "       "*"    
## 6  ( 1 ) "*"       "*"    
## 7  ( 1 ) "*"       "*"    
## 8  ( 1 ) "*"       "*"
summary(regsubsets(quality ~ . ,logdfww,method="forward"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfww, method = "forward")
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          fixed.acidity volatile.acidity citric.acid residual.sugar
## 1  ( 1 ) " "           " "              " "         " "           
## 2  ( 1 ) " "           "*"              " "         " "           
## 3  ( 1 ) " "           "*"              " "         " "           
## 4  ( 1 ) " "           "*"              " "         "*"           
## 5  ( 1 ) " "           "*"              " "         "*"           
## 6  ( 1 ) " "           "*"              " "         "*"           
## 7  ( 1 ) " "           "*"              " "         "*"           
## 8  ( 1 ) " "           "*"              " "         "*"           
##          chlorides free.sulfur.dioxide total.sulfur.dioxide density pH 
## 1  ( 1 ) " "       " "                 " "                  " "     " "
## 2  ( 1 ) " "       " "                 " "                  " "     " "
## 3  ( 1 ) " "       "*"                 " "                  " "     " "
## 4  ( 1 ) " "       "*"                 " "                  " "     " "
## 5  ( 1 ) " "       "*"                 " "                  "*"     " "
## 6  ( 1 ) " "       "*"                 " "                  "*"     " "
## 7  ( 1 ) " "       "*"                 " "                  "*"     "*"
## 8  ( 1 ) " "       "*"                 "*"                  "*"     "*"
##          sulphates alcohol
## 1  ( 1 ) " "       "*"    
## 2  ( 1 ) " "       "*"    
## 3  ( 1 ) " "       "*"    
## 4  ( 1 ) " "       "*"    
## 5  ( 1 ) " "       "*"    
## 6  ( 1 ) "*"       "*"    
## 7  ( 1 ) "*"       "*"    
## 8  ( 1 ) "*"       "*"
summary(regsubsets(quality ~ .,logdfww,method="seqrep"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfww, method = "seqrep")
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
##          fixed.acidity volatile.acidity citric.acid residual.sugar
## 1  ( 1 ) " "           " "              " "         " "           
## 2  ( 1 ) " "           "*"              " "         " "           
## 3  ( 1 ) " "           "*"              " "         " "           
## 4  ( 1 ) " "           "*"              " "         "*"           
## 5  ( 1 ) " "           "*"              " "         "*"           
## 6  ( 1 ) " "           "*"              " "         "*"           
## 7  ( 1 ) " "           "*"              " "         "*"           
## 8  ( 1 ) " "           "*"              " "         "*"           
##          chlorides free.sulfur.dioxide total.sulfur.dioxide density pH 
## 1  ( 1 ) " "       " "                 " "                  " "     " "
## 2  ( 1 ) " "       " "                 " "                  " "     " "
## 3  ( 1 ) " "       "*"                 " "                  " "     " "
## 4  ( 1 ) " "       "*"                 " "                  " "     " "
## 5  ( 1 ) " "       "*"                 " "                  "*"     " "
## 6  ( 1 ) " "       "*"                 " "                  "*"     " "
## 7  ( 1 ) " "       "*"                 " "                  "*"     "*"
## 8  ( 1 ) " "       "*"                 "*"                  "*"     "*"
##          sulphates alcohol
## 1  ( 1 ) " "       "*"    
## 2  ( 1 ) " "       "*"    
## 3  ( 1 ) " "       "*"    
## 4  ( 1 ) " "       "*"    
## 5  ( 1 ) " "       "*"    
## 6  ( 1 ) "*"       "*"    
## 7  ( 1 ) "*"       "*"    
## 8  ( 1 ) "*"       "*"
summary(regsubsets(quality ~ .,logdfww,method="seqrep"))$which
##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 1        TRUE         FALSE            FALSE       FALSE          FALSE
## 2        TRUE         FALSE             TRUE       FALSE          FALSE
## 3        TRUE         FALSE             TRUE       FALSE          FALSE
## 4        TRUE         FALSE             TRUE       FALSE           TRUE
## 5        TRUE         FALSE             TRUE       FALSE           TRUE
## 6        TRUE         FALSE             TRUE       FALSE           TRUE
## 7        TRUE         FALSE             TRUE       FALSE           TRUE
## 8        TRUE         FALSE             TRUE       FALSE           TRUE
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density    pH
## 1     FALSE               FALSE                FALSE   FALSE FALSE
## 2     FALSE               FALSE                FALSE   FALSE FALSE
## 3     FALSE                TRUE                FALSE   FALSE FALSE
## 4     FALSE                TRUE                FALSE   FALSE FALSE
## 5     FALSE                TRUE                FALSE    TRUE FALSE
## 6     FALSE                TRUE                FALSE    TRUE FALSE
## 7     FALSE                TRUE                FALSE    TRUE  TRUE
## 8     FALSE                TRUE                 TRUE    TRUE  TRUE
##   sulphates alcohol
## 1     FALSE    TRUE
## 2     FALSE    TRUE
## 3     FALSE    TRUE
## 4     FALSE    TRUE
## 5     FALSE    TRUE
## 6      TRUE    TRUE
## 7      TRUE    TRUE
## 8      TRUE    TRUE
plot(regsubsets(quality ~ .,logdfww))

summaryMetrics <-  NULL
whichAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward") ) {
  rsRes <- regsubsets(quality~.,logdfww,method=myMthd,nvmax=11)
  summRes <- summary(rsRes)
  whichAll[[myMthd]] <- summRes$which
  for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
    summaryMetrics <- rbind(summaryMetrics,
      data.frame(method=myMthd,metric=metricName,
                nvars=1:length(summRes[[metricName]]),
                value=summRes[[metricName]]))
  }
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") +   theme(legend.position="top")

old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in names(whichAll) ) {
  image(1:nrow(whichAll[[myMthd]]),
        1:ncol(whichAll[[myMthd]]),
        whichAll[[myMthd]],xlab="N(vars)",ylab="",
        xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
        col=c("white","gray"),main=myMthd)
  axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
  axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}

All model’s performance are same for all the 11 variables except for the “bic” graph where the variables tend to go up a little bit.

Red Wine Analysis

But from the diagrams above including the which attribute of the summary,there are 6 variables which appear to be optimal out of which 3 (alcohol,volatile.acidity,sulpahtes) are more optimal than the other 3(chloride,ph,total SO2).we can see them as they form a kind of straight line towards the end of the curve line.

The 6 variables are : alcohol- This was expected all through the analysis starting from subproblem above. This value seem logical as people buy wine because of alcohol present in them. The more the alcohol quantity the better the quality

sulphate - This is also adding values to the total SO2 variable - sulphur dioxide and it is used to protect the wine. wich acts as an antimicrobial and antioxidant.There are many misnomers around how much quantity is optimal in wine.

Volatile acid - This comes from acetic acid created by bacteria in wine. since acid is directly related to PH values . PH value is also one of the variables chloride - This variable didn’t came up during analysis in subproblem above. according to the literature online [http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0101-20612015000100095} specifically red wine contains chloride and gives a salty taste to the wine.

white wine Analysis

we can see a total of 6 optimal variuables out which 3 are best optimal variables. alcohol,volatile acidic and free SO2. This is followed by residual sugar,density, and sulphates

alcohol,sulphates,volatile acidic are descrbed above for redwine and hold the same reasoning for white wine also. residual sugar -Total sulfur dioxide and level of residual sugar are positively correlated. Correlation shows higher value with white wine. White wine density and residual sugar level have positive correlation. Alcohol level of white wine decreases with the growth of residual sugar level

SO2 = can be explained by the involvement of sulphates variable density - This is unexpected based on analysis done in subproblem above. The density seems to be correlated with the residual sugar and with the alcohol whcih in turn determine the quality .

Sub-problem 3: optimal model by cross-validation (25 points)

Use cross-validation (or any other resampling strategy of your choice) to estimate test error for models with different numbers of variables. Compare and comment on the number of variables deemed optimal by resampling versus those selected by regsubsets in the previous task. Compare resulting models built separately for red and white wine data.

#red wine
predict.regsubsets <- function (object, newdata, id, ...){
  form=as.formula(object$call [[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  xvars=names (coefi)
  mat[,xvars] %*% coefi
}
dfTmp <- NULL
whichSum <- array(0,dim=c(11,12,4), 
  dimnames=list(NULL,colnames(model.matrix(quality ~ .,logdfwr)),
                c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 50 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(logdfwr)))
  # Try each method available in regsubsets
  # to select the best model of each size:
  for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
    rsTrain <- regsubsets(quality ~ .,logdfwr[bTrain,],method=jSelect,nvmax=11)
    # Add up variable selections:
    
    whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
    
    # Calculate test error for each set of variables
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:11 ) {
      # make predictions:
      testPred <- predict(rsTrain,logdfwr[!bTrain,],id=kVarSet)
      # calculate MSE:
      mseTest <- mean((testPred-logdfwr[!bTrain,"quality"])^2)
      # add to data.frame for future plotting:
      dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
      mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
    }
  }
}
# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)

## k-fold cross validation (10 fold)
#method for predict
#now we perform best subset selection on the full data set, and select the best ten-variable model. 
regfit.best=regsubsets(quality~.,data=logdfwr ,nvmax=12,,really.big=T)
coef(regfit.best ,11)
##          (Intercept)        fixed.acidity     volatile.acidity 
##           5.32737071           0.05895234          -0.27498320 
##          citric.acid       residual.sugar            chlorides 
##          -0.06366609           0.01009791          -0.31584382 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##           0.01615433          -0.02012607          -6.25107126 
##                   pH            sulphates              alcohol 
##          -0.22882822           0.26708755           0.46221731
#partitions
k=10
set.seed(1)
folds=sample(1:k,nrow(logdfwr),replace=TRUE)

cv.errors=matrix(NA,k,11, dimnames=list(NULL, paste(1:11)))
for(j in 1:k){
best.fit = regsubsets ( quality ~ . , data=logdfwr [ folds  != j , ],nvmax=12)
    for(i in 1:11){
    pred<-predict(best.fit,logdfwr[folds==j,],id=i)
    
    cv.errors[j,i]=mean( (logdfwr$quality[folds==j]-pred)^2)
    }
  }
mean.cv.errors=apply(cv.errors ,2,mean)
mean.cv.errors
##          1          2          3          4          5          6 
## 0.01199423 0.01058970 0.01026104 0.01017251 0.01017239 0.01017296 
##          7          8          9         10         11 
## 0.01011884 0.01008689 0.01007228 0.01010477 0.01008996
par(mfrow=c(1,1))
plot(mean.cv.errors ,type="b")

# white wine

dfTmp <- NULL
whichSum <- array(0,dim=c(11,12,4), 
  dimnames=list(NULL,colnames(model.matrix(quality ~ .,logdfww)),
                c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 50 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(logdfww)))
  # Try each method available in regsubsets
  # to select the best model of each size:
  for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
    rsTrain <- regsubsets(quality ~ .,logdfww[bTrain,],method=jSelect,nvmax=11)
    # Add up variable selections:
    
    whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
    
    # Calculate test error for each set of variables
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:11 ) {
      # make predictions:
      testPred <- predict(rsTrain,logdfww[!bTrain,],id=kVarSet)
      # calculate MSE:
      mseTest <- mean((testPred-logdfww[!bTrain,"quality"])^2)
      # add to data.frame for future plotting:
      dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
      mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
    }
  }
}
# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)

## k-fold cross validation (10 fold)
#method for predict
#now we perform best subset selection on the full data set, and select the best ten-variable model. 
regfit.best=regsubsets(quality~.,data=logdfww ,nvmax=12,really.big=T)
coef(regfit.best ,11)
##          (Intercept)        fixed.acidity     volatile.acidity 
##         13.067606668         -0.003204615         -0.379658910 
##          citric.acid       residual.sugar            chlorides 
##          0.014138854          0.045812118         -0.154265863 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          0.043017426         -0.018854729        -18.323800592 
##                   pH            sulphates              alcohol 
##          0.184960019          0.113450742          0.472879995
#partitions
k=10
set.seed(1)
folds=sample(1:k,nrow(logdfww),replace=TRUE)

cv.errors=matrix(NA,k,11, dimnames=list(NULL, paste(1:11)))
for(j in 1:k){
best.fit = regsubsets ( quality ~ . , data=logdfww [ folds  != j , ],nvmax=12)
    for(i in 1:11){
    pred<-predict(best.fit,logdfww[folds==j,],id=i)
    
    cv.errors[j,i]=mean( (logdfww$quality[folds==j]-pred)^2)
    }
  }
mean.cv.errors=apply(cv.errors ,2,mean)
mean.cv.errors
##          1          2          3          4          5          6 
## 0.01402359 0.01310509 0.01253814 0.01234569 0.01234074 0.01227148 
##          7          8          9         10         11 
## 0.01224258 0.01220519 0.01223475 0.01223085 0.01223612
par(mfrow=c(1,1))
plot(mean.cv.errors ,type="b")

common observations :

The test data and trained data behave almost identically. we can say that the mode size if 5 as the last 5 boxplots are almost constant.

By looking at the graph it looks like all the four methods yield models of very comparable performance for both the wines. There is a difference in terms of RSE box plot graphs because redwine has significanty less number of observations than white wine.

red wine observations:

Error rate is more with test data than the training which could that the process is

moving towards a optimal subset of variables in case of red wine.This should also be related to the number of observations.

density,ph,sulphates and alcohol are seem to be the main predictors. with ph & alcohol to be mre optimal than the other 2. In problem 2 above we had alcohol,volatile.acidity,sulphates as predictor variables.

white wine observations:

for white wine there is not much difference test data and training data.

like red wine white wine also has the same variables as optimal variable overall but density & sulphate are more optimal than the others.

in problem 2 we had alcohol,volatile acidic and free SO2 as optimal variables . so there is a difference for optimal variables in both the cases

Sub-problem 4: lasso/ridge (25 points)

Use regularized approaches (i.e. lasso and ridge) to model quality of red and white wine (separately). Compare resulting models (in terms of number of variables and their effects) to those selected in the previous two tasks (by regsubsets and resampling), comment on differences and similarities among them.

xl <- model.matrix(quality~.,logdfwr)[,-1]
head(xl)
##   fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## 1      2.128232        0.5306283  0.00000000       1.064711 0.07325046
## 2      2.174752        0.6312718  0.00000000       1.280934 0.09349034
## 3      2.174752        0.5653138  0.03922071       1.193922 0.08801088
## 4      2.501436        0.2468601  0.44468582       1.064711 0.07232066
## 5      2.128232        0.5306283  0.00000000       1.064711 0.07325046
## 6      2.128232        0.5068176  0.00000000       1.029619 0.07232066
##   free.sulfur.dioxide total.sulfur.dioxide   density       pH sulphates
## 1            2.484907             3.555348 0.6920466 1.506297 0.4446858
## 2            3.258097             4.219508 0.6915459 1.435085 0.5187938
## 3            2.772589             4.007333 0.6916461 1.449269 0.5007753
## 4            2.890372             4.110874 0.6921467 1.425515 0.4574248
## 5            2.484907             3.555348 0.6920466 1.506297 0.4446858
## 6            2.639057             3.713572 0.6920466 1.506297 0.4446858
##    alcohol
## 1 2.341806
## 2 2.379546
## 3 2.379546
## 4 2.379546
## 5 2.341806
## 6 2.341806
yl <- logdfwr[,"quality"]
mylassoRes <- glmnet(scale(xl),yl,alpha=1)
plot(mylassoRes,label=TRUE)

mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1)
plot(mycvLassoRes)

#log (lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-120:0)/20))
plot(mycvLassoRes)

#log (large lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-120:0)/10))
plot(mycvLassoRes)

#log (large lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-10:5)/5))
plot(mycvLassoRes)

predict(mylassoRes,type="coefficients",s=mycvLassoRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           1.885053743
## fixed.acidity         .          
## volatile.acidity     -0.023272403
## citric.acid           .          
## residual.sugar        .          
## chlorides             .          
## free.sulfur.dioxide   .          
## total.sulfur.dioxide  .          
## density               .          
## pH                    .          
## sulphates             0.007467761
## alcohol               0.034614499
predict(mylassoRes,type="coefficients",s=mycvLassoRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                1
## (Intercept)           1.88505374
## fixed.acidity         .         
## volatile.acidity     -0.02693151
## citric.acid           .         
## residual.sugar        .         
## chlorides             .         
## free.sulfur.dioxide   .         
## total.sulfur.dioxide  .         
## density               .         
## pH                    .         
## sulphates             0.01175615
## alcohol               0.03918927
mylassoResScaled <- glmnet(scale(xl),yl,alpha=1)
mycvLassoResScaled <- cv.glmnet(scale(xl),yl,alpha=1)
predict(mylassoResScaled,type="coefficients",s=mycvLassoResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           1.885053743
## fixed.acidity         .          
## volatile.acidity     -0.024465228
## citric.acid           .          
## residual.sugar        .          
## chlorides             .          
## free.sulfur.dioxide   .          
## total.sulfur.dioxide  .          
## density               .          
## pH                    .          
## sulphates             0.008865725
## alcohol               0.036105821

For red wine by using lasso - looking at cofficients we can see that 3 variables are supposed to be good predictors (volatile.acidity,sulphates & alcohol) which is exactly matching with the analysis of red wine in subproblem 2 above.

myridgeRes <- glmnet(scale(xl),yl,alpha=0)
plot(myridgeRes,label=TRUE)

mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0)
plot(mycvRidgeRes)

mycvRidgeRes$lambda.min
## [1] 0.006177282
mycvRidgeRes$lambda.1se
## [1] 0.07615642
predict(myridgeRes,type="coefficients",s=mycvRidgeRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           1.885053743
## fixed.acidity         0.010354822
## volatile.acidity     -0.029309781
## citric.acid          -0.006009319
## residual.sugar        0.003119953
## chlorides            -0.012170082
## free.sulfur.dioxide   0.008493006
## total.sulfur.dioxide -0.012602212
## density              -0.007926178
## pH                   -0.006464392
## sulphates             0.024247794
## alcohol               0.038845210
predict(myridgeRes,type="coefficients",s=mycvRidgeRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           1.885053743
## fixed.acidity         0.005646014
## volatile.acidity     -0.019888710
## citric.acid           0.004474523
## residual.sugar        0.002102760
## chlorides            -0.008842245
## free.sulfur.dioxide   0.002426017
## total.sulfur.dioxide -0.007636518
## density              -0.008579098
## pH                   -0.002586310
## sulphates             0.016580794
## alcohol               0.026320357
mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0,lambda=10^((-80:80)/20))
plot(mycvRidgeRes)

mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0,lambda=10^((-80:80)/5))
plot(mycvRidgeRes)

myridgeResScaled <- glmnet(scale(xl),yl,alpha=0)
mycvRidgeResScaled <- cv.glmnet(scale(xl),yl,alpha=0)
predict(myridgeResScaled,type="coefficients",s=mycvRidgeResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           1.885053743
## fixed.acidity         0.005646014
## volatile.acidity     -0.019888710
## citric.acid           0.004474523
## residual.sugar        0.002102760
## chlorides            -0.008842245
## free.sulfur.dioxide   0.002426017
## total.sulfur.dioxide -0.007636518
## density              -0.008579098
## pH                   -0.002586310
## sulphates             0.016580794
## alcohol               0.026320357

For ridge regression red wine -It still somewhat shows the same result as lasso but the best fit in this case is for all 11 attributes.

xl <- model.matrix(quality~.,logdfww)[,-1]
head(xl)
##   fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## 1      2.079442        0.2390169   0.3074847      3.0773123 0.04401689
## 2      1.987874        0.2623643   0.2926696      0.9555114 0.04783733
## 3      2.208274        0.2468601   0.3364722      2.0668628 0.04879016
## 4      2.104134        0.2070142   0.2776317      2.2512918 0.05638033
## 5      2.104134        0.2070142   0.2776317      2.2512918 0.05638033
## 6      2.208274        0.2468601   0.3364722      2.0668628 0.04879016
##   free.sulfur.dioxide total.sulfur.dioxide   density       pH sulphates
## 1            3.828641             5.141664 0.6936471 1.386294 0.3715636
## 2            2.708050             4.890349 0.6901427 1.458615 0.3987761
## 3            3.433987             4.584967 0.6906942 1.449269 0.3646431
## 4            3.871201             5.231109 0.6909448 1.432701 0.3364722
## 5            3.871201             5.231109 0.6909448 1.432701 0.3364722
## 6            3.433987             4.584967 0.6906942 1.449269 0.3646431
##    alcohol
## 1 2.282382
## 2 2.351375
## 3 2.406945
## 4 2.388763
## 5 2.388763
## 6 2.406945
yl <- logdfww[,"quality"]
mylassoRes <- glmnet(scale(xl),yl,alpha=1)
plot(mylassoRes,label=TRUE)

mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1)
plot(mycvLassoRes)

#log (lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-120:0)/20))
plot(mycvLassoRes)

#log (large lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-120:0)/10))
plot(mycvLassoRes)

#log (large lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-10:5)/5))
plot(mycvLassoRes)

predict(mylassoRes,type="coefficients",s=mycvLassoRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           1.919907e+00
## fixed.acidity         .           
## volatile.acidity     -1.862303e-02
## citric.acid           .           
## residual.sugar        5.734653e-05
## chlorides             .           
## free.sulfur.dioxide   1.287672e-02
## total.sulfur.dioxide  .           
## density               .           
## pH                    .           
## sulphates             .           
## alcohol               4.888323e-02
predict(mylassoRes,type="coefficients",s=mycvLassoRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           1.919907e+00
## fixed.acidity         .           
## volatile.acidity     -1.862303e-02
## citric.acid           .           
## residual.sugar        5.734653e-05
## chlorides             .           
## free.sulfur.dioxide   1.287672e-02
## total.sulfur.dioxide  .           
## density               .           
## pH                    .           
## sulphates             .           
## alcohol               4.888323e-02
mylassoResScaled <- glmnet(scale(xl),yl,alpha=1)
mycvLassoResScaled <- cv.glmnet(scale(xl),yl,alpha=1)
predict(mylassoResScaled,type="coefficients",s=mycvLassoResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           1.9199069221
## fixed.acidity        -0.0011462225
## volatile.acidity     -0.0219211915
## citric.acid           .           
## residual.sugar        0.0040690778
## chlorides            -0.0006769184
## free.sulfur.dioxide   0.0152813087
## total.sulfur.dioxide  .           
## density               .           
## pH                    .           
## sulphates             .           
## alcohol               0.0538604059

For Lasso regression white wine shows that the good predictors are volatile.acidity,residual.sugar,free.sulfur.dioxide and alcohol which is little different than the subproblem 2 above

myridgeRes <- glmnet(scale(xl),yl,alpha=0)
plot(myridgeRes,label=TRUE)

mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0)
plot(mycvRidgeRes)

mycvRidgeRes$lambda.min
## [1] 0.006014445
mycvRidgeRes$lambda.1se
## [1] 0.04243072
predict(myridgeRes,type="coefficients",s=mycvRidgeRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           1.919906922
## fixed.acidity        -0.002085572
## volatile.acidity     -0.026917224
## citric.acid           0.001508390
## residual.sugar        0.026852434
## chlorides            -0.004478281
## free.sulfur.dioxide   0.021660684
## total.sulfur.dioxide -0.005914855
## density              -0.021628782
## pH                    0.005266119
## sulphates             0.007647940
## alcohol               0.048681719
predict(myridgeRes,type="coefficients",s=mycvRidgeRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           1.919906922
## fixed.acidity        -0.003956881
## volatile.acidity     -0.020311641
## citric.acid           0.002204200
## residual.sugar        0.013886076
## chlorides            -0.007769864
## free.sulfur.dioxide   0.016841601
## total.sulfur.dioxide -0.004259868
## density              -0.015480875
## pH                    0.003731517
## sulphates             0.005528480
## alcohol               0.036359717
mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0,lambda=10^((-80:80)/20))
plot(mycvRidgeRes)

mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0,lambda=10^((-80:80)/5))
plot(mycvRidgeRes)

myridgeResScaled <- glmnet(scale(xl),yl,alpha=0)
mycvRidgeResScaled <- cv.glmnet(scale(xl),yl,alpha=0)
predict(myridgeResScaled,type="coefficients",s=mycvRidgeResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           1.919906922
## fixed.acidity        -0.003956881
## volatile.acidity     -0.020311641
## citric.acid           0.002204200
## residual.sugar        0.013886076
## chlorides            -0.007769864
## free.sulfur.dioxide   0.016841601
## total.sulfur.dioxide -0.004259868
## density              -0.015480875
## pH                    0.003731517
## sulphates             0.005528480
## alcohol               0.036359717

For Ridge regression white wine has still volatile.acidity,density,residual-sugar,alcohol as good predictors but according the lambda diagrams above we can see that all 11 variables are needed to get optimal lambda value which is not same as lasse regression and is not also not agreeging with the findings of subproblem 2 & 3 above

Sub-problem 5: PCA (10 points)

Merge data for red and white wine (function rbind allows merging of two matrices/data frames with the same number of columns) and plot data projection to the first two principal components (e.g. biplot or similar plots). Does this representation suggest presence of clustering structure in the data? Does wine type (i.e. red or white) or quality appear to be associated with different regions occupied by observations in the plot? Please remember not to include quality attribute or wine type (red or white) indicator in your merged data, otherwise, apparent association of quality or wine type with PCA layout will be influenced by presence of those indicators in your data.

#Merge the 2 wines and perform initial study of data 
comwine<-rbind(logdfwr[,-12],logdfww[,-12])
dim(comwine)
## [1] 6497   11
head(comwine)
##   fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## 1      2.128232        0.5306283  0.00000000       1.064711 0.07325046
## 2      2.174752        0.6312718  0.00000000       1.280934 0.09349034
## 3      2.174752        0.5653138  0.03922071       1.193922 0.08801088
## 4      2.501436        0.2468601  0.44468582       1.064711 0.07232066
## 5      2.128232        0.5306283  0.00000000       1.064711 0.07325046
## 6      2.128232        0.5068176  0.00000000       1.029619 0.07232066
##   free.sulfur.dioxide total.sulfur.dioxide   density       pH sulphates
## 1            2.484907             3.555348 0.6920466 1.506297 0.4446858
## 2            3.258097             4.219508 0.6915459 1.435085 0.5187938
## 3            2.772589             4.007333 0.6916461 1.449269 0.5007753
## 4            2.890372             4.110874 0.6921467 1.425515 0.4574248
## 5            2.484907             3.555348 0.6920466 1.506297 0.4446858
## 6            2.639057             3.713572 0.6920466 1.506297 0.4446858
##    alcohol
## 1 2.341806
## 2 2.379546
## 3 2.379546
## 4 2.379546
## 5 2.341806
## 6 2.341806
colnames(comwine)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"
pca.out<-prcomp(comwine,scale=TRUE)
plot(pca.out)

biplot(pca.out,scale=TRUE)

PCA analysis

By looking at the biplot we dont see any presence of clustering. all the data points are concentrated in the center

PC1 places more importance to citric acid, SO2,alcohol & quality

PC2 places more importance to density,chlorides and volatile acidity both do not give importance to PH value.

quality of wine appears to be associated more to PC1.

By the row numbers displayed in the biplot we can see that the wine types are spread across out mostly but closely looking we see that white wine data it looks like densities, chlorides ,sulphides , ph values and residual sugar determine white wne where as density chlorides determine red wine. which is slightly different than the abpve analysis.

Extra 10 points: model wine quality using principal components

Compute PCA representation of the data for one of the wine types (red or white) excluding wine quality attribute (of course!). Use resulting principal components (slot x in the output of prcomp) as new predictors to fit a linear model of wine quality as a function of these predictors. Compare resulting fit (in terms of MSE, r-squared, etc.) to those obtained above. Comment on the differences and similarities between these fits.

#modelling the wine quality using prncipal components for red wine.
pca.out<-prcomp(logdfwr[,-12],scale=TRUE)
summary(pca.out)
## Importance of components%s:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7717 1.4190 1.2565 1.0879 1.00548 0.81051 0.74878
## Proportion of Variance 0.2854 0.1831 0.1435 0.1076 0.09191 0.05972 0.05097
## Cumulative Proportion  0.2854 0.4684 0.6119 0.7195 0.81145 0.87117 0.92214
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.62261 0.50675 0.39714 0.23301
## Proportion of Variance 0.03524 0.02335 0.01434 0.00494
## Cumulative Proportion  0.95738 0.98073 0.99506 1.00000
plot(pca.out)

biplot(pca.out,scale=TRUE)

pca.out$x[1:10,]
##           PC1        PC2        PC3         PC4         PC5         PC6
## 1  -1.6605340  0.6960741  1.6430079  0.13548841  0.13615726  0.98752020
## 2  -0.7849915  2.0585544  0.7822421  0.41676005 -0.26291028 -0.79618333
## 3  -0.7358896  1.2231335  0.9462328  0.38158913 -0.04795376 -0.30556219
## 4   2.2587260  0.1480034 -0.6244777 -0.54372152  1.87183292  0.08243416
## 5  -1.6605340  0.6960741  1.6430079  0.13548841  0.13615726  0.98752020
## 6  -1.6545393  0.8790980  1.3629100  0.15620118  0.34250052  1.03589340
## 7  -1.2123682  0.9036645  0.9377109 -0.02711461  1.53163373 -0.20708960
## 8  -2.4474399 -0.4207818  0.9579970  0.47354052  1.31214704 -0.29664324
## 9  -1.0890793 -0.3629164  1.5691287  0.17030774  0.35978883  0.54137213
## 10  0.7044294  1.5579468 -1.1274291 -1.45818109 -1.87639791  0.46087576
##            PC7         PC8         PC9        PC10        PC11
## 1  -0.12735849  0.32103329 -0.25263921 -0.26807751  0.04148620
## 2  -1.18330532 -0.81089290 -0.28244916  0.02653793 -0.04237161
## 3  -0.73740570 -0.52008093 -0.07436388 -0.26507990 -0.04758413
## 4   0.34417641  0.46538243 -0.12423025 -0.24027108  0.23032457
## 5  -0.12735849  0.32103329 -0.25263921 -0.26807751  0.04148620
## 6  -0.09359533  0.36077583 -0.34316817 -0.34703528  0.01491793
## 7   0.02786749 -0.08789045 -0.18051734 -0.47862964  0.08808856
## 8  -0.23927040  0.06877128 -0.70414027  0.29283533  0.17945317
## 9   0.08029610 -0.46636010 -0.60756237  0.03419638  0.11330259
## 10  0.34061746 -0.79112784  0.94011673 -0.13286267  0.13648525
mww<-lm(logdfwr$quality ~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+PC11,as.data.frame.matrix(pca.out$x))
summary(mww)
## 
## Call:
## lm(formula = logdfwr$quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + 
##     PC6 + PC7 + PC8 + PC9 + PC10 + PC11, data = as.data.frame.matrix(pca.out$x))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51177 -0.05083 -0.00499  0.06926  0.27889 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.885054   0.002489 757.209  < 2e-16 ***
## PC1          0.007937   0.001406   5.647 1.93e-08 ***
## PC2         -0.030493   0.001755 -17.376  < 2e-16 ***
## PC3         -0.040046   0.001982 -20.206  < 2e-16 ***
## PC4         -0.006537   0.002289  -2.856  0.00435 ** 
## PC5         -0.010747   0.002477  -4.339 1.52e-05 ***
## PC6          0.003919   0.003072   1.275  0.20234    
## PC7         -0.017072   0.003326  -5.133 3.20e-07 ***
## PC8         -0.012640   0.004000  -3.160  0.00161 ** 
## PC9         -0.028569   0.004914  -5.814 7.38e-09 ***
## PC10        -0.007890   0.006270  -1.258  0.20846    
## PC11        -0.005347   0.010687  -0.500  0.61696    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09955 on 1587 degrees of freedom
## Multiple R-squared:  0.3468, Adjusted R-squared:  0.3423 
## F-statistic: 76.61 on 11 and 1587 DF,  p-value: < 2.2e-16

Comparing the model of red wine using principal components with log transformed of red wine we can see that both RSE and RS^2 are having the same values. Although the slopes of the coefficients change which could be because PC1 is a high variance values. From the the summary above, we can undersand PC1 explains 29% of variance and PC2 explains 18% and so on.